Text Categorization using the Semi-Supervised Fuzzy c-Means Algorithm
نویسندگان
چکیده
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. For the past few years, TC has become very important essentially in the Information Retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. In this paper, we compare , for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the “Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm [ I ] and the SemiSupervised Fuzzy-cMeans (ssFCM) algorithm [2]. This (Semi-Supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 2 1578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select based on an information gain criterion. We verify experimentally that sFCM both outperforms and takes less time than the Fuzzy -cMeans (FCM) algorithm. With a smaller number of features, ssFCM’s performance is also superior to that of ssAHC’s [3]. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.
منابع مشابه
A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization
Hypertext/text domains are characterized by several tens or hundreds of thousands of features. This represents a challenge for supervised learning algorithms which have to learn accurate classifiers using a small set of available training examples. In this paper, a fuzzy semi-supervised support vector machines (FSS-SVM) algorithm is proposed. It tries to overcome the need for a large labelled t...
متن کاملDocument Clustering Based On Semi-Supervised Term Clustering
The study is conducted to propose a multi-step feature (term) selection process and in semi-supervised fashion, provide initial centers for term clusters. Then utilize the fuzzy c-means (FCM) clustering algorithm for clustering terms. Finally assign each of documents to closest associated term clusters. While most text clustering algorithms directly use documents for clustering, we propose to f...
متن کاملA fuzzy semi-supervised support vector machine approach to hypertext categorization
Hypertext/text domains are characterized by several tens or hundreds of thousands of features. This represents a challenge for supervised learning algorithms which have to learn accurate classifiers using a small set of available training examples. In this paper, a fuzzy semi-supervised support vector machines (FSS-SVM) algorithm is proposed. It tries to overcome the need for a large labelled t...
متن کاملSemi-supervised Text Categorization Using Recursive K-means Clustering
In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partiti...
متن کاملImprove Semi-Supervised Fuzzy C-means Clustering Based On Feature Weighting
Semi-supervised learning is somewhere between unsupervised and supervised learning. In fact, most semi-supervised learning strategies are based on extending either unsupervised or supervised learning to include additional information typical of the other learning paradigm. Constraint fuzzy c-means a novel semi-supervised fuzzy c-means algorithm proposed by Li et al [1]. Constraint FCM like FCM ...
متن کامل